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Abstract 



In this paper we first study a smooth optimization approach for solving a class of 
non-smooth strictly concave maximization problems whose objective functions admit 
smooth convex minimization reformulations. In particular, we apply Nesterov's smooth 
optimization technique |19l \^2T\ to their dual counterparts that are smooth convex prob- 
lems. It is shown that the resulting approach has 0(1/-^) iteration complexity for 
finding an e-optimal solution to both primal and dual problems. We then discuss the 
application of this approach to sparse covariance selection that is approximately solved 
as a /i-norm penalized maximum likelihood estimation problem, and also propose a 
variant of this approach which has substantially outperformed the latter one in our com- 
putational experiments. We finally compare the performance of these approaches with 
other first-order methods, namely, Nesterov's 0{l/e) smooth approximation scheme and 
block-coordinate descent method studied in [HI US] for sparse covariance selection on a 
set of randomly generated instances. It shows that our smooth optimization approach 
substantially outperforms the first method above, and moreover, its variant substantially 
outperforms both methods above. 

Key words: Covariance selection, non-smooth strictly concave maximization, smooth 
minimization 

AMS 2000 subject classification: 90C22, 90C25, 90C47, 65K05, 62J10 

1 Introduction 

In fl9[ |2T] . Nesterov proposed an efficient smooth optimization method for solving convex 
programming problems of the form 
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where / is a convex function with Lipschitz continuous gradient, and U is a closed convex 
set. It is shown that his method has 0{l/y/e) iteration complexity bound, where e > is the 
absolute precision of the final objective function value. A proximal-point-type algorithm for 
([T]) having the same complexity as above has also been proposed more recently by Auslender 
and TebouUe [2]. 

Motivated by [9J, we are particularly interested in studying the use of smooth optimization 
approach for solving a class of non-smooth strictly concave maximization problems whose 
objective functions admit smooth convex minimization reformulations in this paper. Our key 
idea is to apply Nesterov's smooth optimization technique [191 El] to their dual counterparts 
that are smooth convex problems. It is shown that the resulting approach has 0{l/^/e) 
iteration complexity for finding an e-optimal solution to both primal and dual problems. 

One interesting application of the above approach is for sparse covariance selection. Given 
a set of random variables with Gaussian distribution for which the true covariance matrix is 
unknown, covariance selection is a procedure used to estimate true covariance from a sample 
covariance matrix by maximizing its likelihood while imposing a certain sparsity on the in- 
verse of the covariance estimation (e.g., see pj]). Therefore, it can be applied to determine a 
robust estimate of the true variance matrix, and simultaneously discover the sparse structure 
in the underlying model. Despite its popularity in numerous real- world applications (e.g., see 
[HISIES] and the references therein), sparse covariance selection itself is a challenging NP-hard 
combinatorial optimization problem. By an argument that is often used in regression tech- 
niques such as LASSO [23], Yuan and Lin [22] and d'Aspremont et al. [H] (see also [3]) showed 
that it can be approximately solved as a /i-norm penalized maximum likelihood estimation 
problem. Moreover, the authors of [9] studied two efficient first-order methods for solving 
this problem, that is, Nesterov's smooth approximation scheme and block-coordinate descent 
(BCD) method. It was shown in [S] that their first method has 0{l/e) iteration complexity 
for finding an e-optimal solution. For their second method, each iterate requires solving a box 
constrained quadratic programming, and it has a local linear convergence rate. However, its 
global iteration complexity for finding an e-optimal solution is theoretically unknown. After 
the first release of our paper, Friedman et al. [151 studied a slight variant of the BCD method 
proposed in [9j. At each iteration of their method, a coordinate descent approach is applied to 
solve a lasso (/i-regularized) least-squares problem, which is the dual of the box constrained 
quadratic programming appearing in the BCD method [9]. In contrast with these methods, 
the smooth optimization approach proposed in this paper has a more attractive iteration com- 
plexity that is 0{l/^/e) for finding an e-optimal solution. In addition, we propose a variant 
of the smooth optimization approach which has substantially outperformed the latter one in 
our computational experiments. We also compare the performance of our approaches with 
their methods for sparse covariance selection on a set of randomly generated instances. It 
shows that our smooth optimization approach substantially outperforms their first method 
above (i.e., Nesterov's smooth approximation scheme), and moreover, its variant substantially 
outperforms their methods [9], |15] mentioned above. 

The paper is organized as follows. In Section [21 we introduce a class of non-smooth 
concave maximization problems in which we are interested, and propose a smooth optimization 
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approach to them. In Section [3], we briefly introduce sparse covariance selection, and show 
that it can be approximately solved as a /i-norm penalized maximum likelihood estimation 
problem. We also discuss the application of the smooth optimization approach for solving this 
problem, and propose a variant of this approach. In Section HI we compare the performance of 
our smooth optimization approach and its variant with two other first-order methods studied 
in [9l [15] for sparse covariance selection on a set of randomly generated instances. Finally, we 
present some concluding remarks in Section [51 



1.1 Notation 

In this paper, all vector spaces are assumed to be finite dimensional. The space of symmetric 
n X n matrices will be denoted by 5". If X G 5" is positive semi definite, we write X ^ 0. 
Also, we write X to mean Y — X y 0. The cone of positive semidefinite (resp., definite) 
matrices is denoted by iS" (resp., i^"^). Given matrices X and Y in W^'^, the standard inner 
product is defined by {X,Y) := Tt{XY'^), where Tr(-) denotes the trace of a matrix. || ■ || 
denotes the Euclidean norm and its associated operator norm unless it is explicitly stated 
otherwise. The Frobenius norm of a real matrix X is defined as HXHi? := ^yTr(XX'^). We 
denote by e the vector of all ones, and by / the identity matrix. Their dimensions should be 
clear from the context. For a real matrix X, we denote by Card(X) the cardinality of X, that 
is, the number of nonzero entries of X, and denote by |X| the absolute value of X, that is, 
\X\ij = \Xij\ for all The determinant and the minimal (resp., maximal) eigenvalue of a 
real symmetric matrix X are denoted by detX and Amin(X) (resp., Ainax(-'^))5 respectively. 
For a n-dimensional vector w, diag(w) denote the diagonal matrix whose z-th diagonal element 
is Wi for i = 1, ... ,n. We denote by Z+ the set of all nonnegative integers. 

Let the space JF be endowed with an arbitrary norm || ■ ||. The dual space of JF, denoted by 
JF*, is the normed real vector space consisting of all linear functionals of s : JF ^ 9ft, endowed 
with the dual norm || ■ ||* defined as 

:= maxKs,^) : \\u\\ < 1}, Vs G JF*, 

u 

where {s,u) := s{u) is the value of the linear functional s at u. Finally, given an operator 
^ : JF ^ JF*, we define 

A[H,H] := {AH,H) 

for any G JF. 



2 Smooth optimization approach 

In this section, we consider a class of concave non-smooth maximization problems: 

ma.x{g{x) := mm(f){x,u)}, (2) 

where X and U are nonempty convex compact sets in finite-dimensional real vector spaces 
S and JF, respectively, and (j){x, u) : X x [/ — >■ 3ft is a continuous function which is strictly 
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concave in x G X for every fixed u G U, and convex differentiable in u E U for every fixed 
X E X. Therefore, for any u G U, the function 

f{u) := max (f){x,u) (3) 

is well-defined. We also easily conclude that f{u) is convex differentiable on U, and its gradient 
is given by 

V/(m) = V.0(x(m),m), WueU, (4) 

where x{u) denotes the unique solution of ([3]). 

Let the space JF be endowed with an arbitrary norm || ■ ||. We further assume that V/(m) 
is Lipschitz continuous on U with respect to || ■ ||, i.e., there exists some L > such that 

||V/(m) - V/(M)|r < L\\u-ul yu,uE U. 

Under the above assumptions, we easily observe that: i) problem ([21) and its dual, that is, 

min{/(M) : uEU}, (5) 

u 

are both solvable and have the same optimal value; and ii) the dual problem ([5]) can be suitably 
solved by Nesterov's smooth minimization approach [T9| [2T]. 

Denote by d{u) a prox-function of the set U . We assume that d{u) is continuous and 
strongly convex on U with modulus a > 0. Let Uq be the center of the set U defined as 

Mo = argmin{(i(-u) : u eU}. (6) 

Without loss of generality assume that d{uQ) = 0. We now describe Nesterov's smooth min- 
imization approach [191 121] for solving the dual problem , and we will show that it simul- 
taneously solves the non-smooth concave maximization problem 

Smooth Minimization Algorithm: 

Let Mo G f/ be given in Set x_i = and k = 0. 

1) Compute V/(Mfc) and x{uk). Set Xk = ^Xk-i + ^x{uk). 

2) Find G Argmin {(V/(Mfc),M - Mfc) + I ||m - Mfelp : uEU). 

3) FindM^^ = argmin|^rf(M) + |:^[/(M,) + (V/(M,),M-M,)] : w G f/|. 

4) Set Uk^, = ^^ul^ + f±i«r- 

5) Set k k + 1 and go to step 1). 
end 

The following property of the above algorithm is established in Theorem 2 of Nesterov 

ED. 
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Theorem 2.1 Let the sequence {{uk, uI^)}'^^q C U xU be generated by the Smooth Minimiza- 
tion Algorithm. Then for any k > we have 

^ ^ f{uf) < mill I ^d{u) + g ^[/(^.) + (V/(^.), u - u,)] : « G f/| . (7) 

We are ready to establish the main convergence result of the Smooth Minimization Al- 
gorithm for solving the non-smooth concave maximization problem ([2]) and its dual Its 
proof is a generalization of the one given in a more special context in [21j . 

Theorem 2.2 After k iterations, the Smooth Minimization Algorithm generates a pair of 
approximate solutions iu'^^, Xk) to problem ^ and its dual respectively, which satisfy the 
following inequality: 

0<f{ul'')-g{x,)< 



cr{k + l){k + 2)' ' ' 

Thus if the termination criterion f{ul^) — g{xk) < e is applied, the iteration complexity of 
finding an e-optimal solution to problem ^ and its dual ^ by the Smooth Minimization 

Algorithm does not exceed 2^J^-, where 

D = max{d{u) : ueU}. (9) 
Proof In view of ([2D, (jlD and the notation x{u), we have 

f{Ui) + (V/(Ui), U-Ui) = (l){x{Ui),Ui) + (V„0(x(Ui), Mi), M - Ui). (10) 

Invoking the fact that the function ■) is convex on U for every fixed a; G X, we obtain 

(f>{x{Ui), Ui) + {Vu(p{x{Ui),Ui),U - Ui) < (f){x{Ui),u). (11) 

Notice that X-i = 0, and Xk = -^Xfe-i + -j^x{uk) for any k > 0, which imply 

k 



E2(« + 1 ) , X , ^ 



i=0 



{k + l){k + 2) 



Using ( fTOl) . ( ITTi) . ( |T2l) and the fact that the function (f){-,u) is concave on X for every fixed 
u E U, we have 

k k 

J2i^ + l)[f{ui) + {Vf{ui),u-Ui)] < Y,i' + l)<P{x{ui),u) 

i=0 1=0 

< ^{k + l){k + 2)<p{xk,u) 
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for all u eU . It follows from this relation, ([7]), ([9]) and ([2]) that 




[/(n,) + (V/(n,),n-M,)] : ueU 



ALD 



+ mm(f){xk,u) 



ALD 



< 



a{k + l){k + 2) 



a{k + l){k + 2) 



and hence the inequality ([8]) holds. The remaining conclusion directly follows from ([8]). ■ 

Remark. We shall mention that Nesterov [20] developed the excessive gap technique for 
solving problem ([2]) and its dual ([5]) in a special context, which enjoys the same iteration 
complexity as the Smooth Minimization Algorithm described above. In addition, it is not 
hard to observe that the technique proposed in [20] can be extended to solve problem ([2]) and 
its dual in the aforementioned general framework provided that the subproblem 



can be suitably solved for any given > and x G X. The computation of each iterate 
of Nesterov's excessive gap technique [20] is similar to that of the Smooth Minimization 
Algorithm except that the former method requires solving a prox subproblem in the form of 
( |T3|) . but the latter one needs to solve the prox subproblem described in step 3) above. When 
the function ■) is affine for every fixed x G X, these two prox subproblems have the same 
form, and thus the computational cost of Nesterov's excessive gap technique [20] is almost 
same as that of the Smooth Minimization Algorithm; however, for a more general function 
■), the computational cost of the former method can be more expensive than that of the 
latter method. ■ 

The following results will be used to develop a variant of the Smooth Minimization Algo- 
rithm for sparse covariance selection in Subsection 13. 4[ 

Lemma 2.3 Problem ^ has a unique optimal solution, denoted by x* . Moreover, for any 
u* G Argmin{/(M) : u E U} , we have 



Proof. We clearly know that problem has an optimal solution. To prove its uniqueness, 
it suffices to show that g{x) is strictly concave on X. Indeed, since X xU is a. convex compact 
set and (j){x,u) is continuous on X x U, it follows that for any t G (0, 1), 7^ G X, there 
exists some u E U such that 



Recall that (/)(■, u) is strictly concave on X for every fixed u E U. Therefore, we have 
(f){tx^ + {1 -t)x'^,u) > t(j){x^,u) + {1 -t)(f){x^,u), 

> t min0(x"'^, m) + (1 — t) min 0(x^, m). 



min (f){x, u) + ^d{u) 



(13) 



X* = argmax0(x, M 



(14) 



4>{tx^ + (1 - t)x^, u) = min 0(te^ + (1 - t)x^, u) . 
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which together with ([2]) imphes that 

g{tx' + (1 - t)x^) > tg{x^) + (1 - t)g{x^) 

for any t G (0, 1), a;^ 7^ G X, and hence, g{x) is strictly concave on X as desired. 

Note that x* is the optimal solution of problem ([2]). We clearly know that for any u* G 
Argmin{/(u) : u G U}, {u*,x*) is a saddle point for problem ([2]), that is, 

and hence, we have 

X* G Argmax0(x, M*). 

It together with the fact that </){■, u*) is strictly concave on X, immediately yields dHl). ■ 

Theorem 2.4 Lei x* he the unique optimal solution of and f* be the optimal value of 
problems ^ and Assume that the sequences {ufcj^Q and {x(ufc)}^Q are generated by 
the Smooth Minimization Algorithm. Then the following statements hold: 

1) f{uk) f* , x{uk) X* as k ^ 00; 

2) f{uk) - g{x{uk)) ^0 as k ^ 00. 

Proof. Recall from the Smooth Minimization Algorithm that 

Uk+i = {2ul' + {k + l)uf) /{k + 3), VA; > 0. 

Since u^jfiU^ G U for VA; > 0, and [/ is a compact set, we have Uk+i — u'^jf ^ as A; ^ 00. 
Notice that f{u) is continuous on the compact set f/, and hence, it is uniformly continuous 
on U . Then we further have /(^X/t+i) — f{Wk^) —>■ as k —>■ 00. Also, it follows from Theorem 
12.21 that f{u^k) — > /* as A; 00. Therefore, we conclude that f{uk) /* as A; — > 00. 

Note that X is a compact set, and x{uk) C X for VA; > 0. To prove that x{uk) —>■ x* 
as A; — > 00, it suffices to show that every convergent subsequence of {a;(ufc)}^o converges to 
X* as A; ^ 00. Indeed, assume that {x{unJ}'kLo is an arbitrary convergent subsequence, and 
X* as k ^ 00 for some x* E X. Without loss generality, assume that the sequence 
{unk}T=o ^ M* as A; — > 00 for some u* E U (otherwise, one can consider any convergent 
subsequence of {un^}'^^^). Using the result that f{uk) ^ /*, we obtain that 

(x(m„J, M„J = /(m„J /*, as A; 00. 

Upon letting A; ^ 00 and using the continuity of 4>{-,-), we have (f){x*,u*) = f{u*) = f*. 
Hence, it follows that 

u* G Argmin/(u), x* = argmax0(x,'U*), 

which together with Lemma 12.31 implies that x* = x*. Hence as desired, x{un^) — > x* as 
A; — > 00. 

As shown in Lemma [2.31 the function g{x) is continuous on X. This result together with 
statement 1) immediately implies that statement 2) holds. ■ 
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3 Sparse covariance selection 



In this section, we discuss the apphcation of the smooth optimization approach proposed 
in Section [2] to sparse covariance selection. More specifically, we briefiy introduce sparse 
covariance selection in Subsection I3.H and show that it can be approximately solved as a 
/i-norm penalized maximum likelihood estimation problem in Subsection I3.2[ In Subsection 
13. 3[ We address some implementation details of the smooth optimization approach for solving 
this problem, and propose a variant of this approach in Subsection 13.41 



3.1 Introduction of sparse covariance selection 

In this subsection, we briefiy introduce sparse covariance selection. For more details, see 
d'Aspremont et al. [D] and the references therein. 

Given n variables with a Gaussian distribution A/'(0, C) for which the true covariance 
matrix C is unknown, we are interested in estimating C from a sample covariance matrix S by 
maximizing its likelihood while imposing a certain number of components in the inverse of the 
estimation of C to zero. This problem is commonly known as sparse covariance selection (see 
Since zeros in the inverse of covariance matrix correspond to conditional independence 
in the model, sparse covariance selection can be used to determine a robust estimate of the 
covariance matrix, and simultaneously discover the sparse structure in the underlying graphical 
model. 

Several approaches have been proposed for sparse covariance selection in literature. For 
example, Bilmes [4] proposed a method based on choosing statistical dependencies according 
to conditional mutual information computed from training data. The recent works [iHl [T3] 
involve identifying the Gaussian graphical models that are best supported by the data and 
any available prior information on the covariance matrix. Given a sample covariance matrix 
S G iS", d'Aspremont et al. [9] recently formulated sparse covariance selection as the following 
estimation problem: 

max log det X - (S, X) - pCard(X) 

X ^ : ' ' ^ ^ (15) 

s.t. a/ ^ X ^ /?/, 

where p > is a parameter controlling the trade-off between likelihood and cardinality, and 
< a < f3 < oo are the fixed bounds on the eigenvalues of the solution. For some specific 
choices of p, the formulation (ITBI) has been used for model selection in [U E] , and applied to 
speech recognition and gene network analysis (see |ll|T2]). 

Note that the estimation problem f|T5|) itself is a NP-hard combinatorial problem because 
of the penalty term Card(X). To overcome the computational difficulty, d'Aspremont et al. 
[9] used an argument that is often used in regression techniques (e.g., see |23l EJ [H] ) , where 
sparsity of the solution is concerned, to relax Card(X) to e^|X|e, and obtained the following 
/i-norm penalized maximum likelihood estimation problem: 



max logdetX — (S,X) — pe |X|e 
s.t. al ^X ^ (31, 



(16) 
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Recently, Yuan and Lin [25] proposed a similar estimation problem for sparse covariance 

selection given as follows: 

max log det X — (S, X) — p ^ | 

^ i^] (17) 

s.t. dil ^ pi, 

with a = and (3 = oo. They showed that problem f|T7j) can be suitably solved by the interior 
point algorithm developed in Vandenberghe et al. [21]. A few other approaches have also been 
studied for sparse covariance selection by solving some related maximum likelihood estima- 
tion problems in literature. For example, Huang et al. [17J proposed an iterative (heuristic) 
algorithm to minimize a nonconvex penalized likelihood. Dahl et al. [HI E] applied Newton 
method, coordinate steepest descent method, and conjugate gradient method for the problems 
for which the conditional independence structure is partially known. 

As shown in d'Aspremont et al. [9] (see also [3]), and Yuan and Lin [25], the /i-norm 
penalized maximum likelihood estimation problems ( JT6l) and (ITTll are capable of discovering 
effectively the sparse structure, or equivalently, the conditional independence in the underlying 
graphical model. Also, it is not hard to see that the estimation problem (|T71) becomes a special 
case of problem f|T6l) if replacing S by S + pJ in f|T71) . For these reasons, we focus on problem 
f|T6|) only for the remaining paper. 

3.2 Non-smooth strictly concave maximization reformulation 

In this subsection, we show that problem (|T6|) can be reformulated as a non-smooth strictly 
concave maximization problem of the form ([2]). 

Recall from Subsection 13.11 that S G iS", and keep in mind that the notations | ■ |, || ■ || 
and II • ||i? are defined in Subsection ll.il We first provide some tighter bounds on the optimal 
solution of problem (fT6|) for the case where a = and (3 = oo. 

Proposition 3.1 Assume that a = and 13 = oo. Let X* G iS"_|_ he the unique optimal 
solution of problem / fi^) . Then we have al < X* < [31 , where 




(18) 



with 



V 



min|e"^|S ^\e, {n — p^/na)\\T, '^\\ — 
2e^|(S + f/)-i|e-Tr((S + f/)-i), 



{n — l)a;} , ifTiis invertible; 



otherwise. 



Proof. Let 



W:={f/G5": |f/,,| <l,Vu}, 



(19) 



and 



L{X, U) = log det X - (S + pU, X) , V(X, U) G 5^+ x U. 



(20) 
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Note that X* G 5"^ is the optimal solution of problem (fT6|) . It can be easily shown that there 
exists some U* eU such that (X*, U*) is a saddle point of ■) on 5"^ x U, that is, 

X* = arg min L(X,U*), U* G Argmin L(X*, f/). 



The above relations along with f[T9|) and fl20|) immediately yield 

X*(S + pt/*) = /, {X*,U*) = e^\X*\e. (21) 

Hence, we have 

X* = + pU*)-^ y ^ 



+p||f/*|| ' 

which together with (fT9|) and the fact U* G W, implies that X* >z ^p^-^- Thus as desired. 



X* >z ctl, where a is given in f[T8|) . 

We next bound X* from above. In view of fl2T]) . we have 

(X*,S) + pe^|X*|e = n, (22) 

which together with the relation X* >z al implies that 

e^\X*\e < ^ (23) 

Now let X{t) := (S + tpl)~^ for t G (0, 1). By concavity of logdet(-), one can easily see that 
X{t) maximizes the function logdet(-) — {H + tpI, ■) over 5"^. Using this observation and the 
definition of X*, we can have 

logdetX* - (S + tp/,X*) < logdetX(t) - (S + tp/,X(t)), 
logdetX(t) - (S,X(t)) -pe^|X(t)|e < logdetX* - (S,X*) - pe^|X*|e. 

Adding the above two inequalities upon some algebraic simplification, we obtain that 

e^|X*|e - tTr(X*) < e^\X{t)\e - tTr(X(t)), 



r,^*u/e^W)|e-tTr(X(t)) 



e'\X*\e< ' ^^'^ — ^ ^ ' VtG(0, 1). (24) 



and hence. 



If E is invertible, upon letting t | on both sides of (1241) . we have 

e^\X*\e < e^|S->. 
Otherwise, letting t = 1/2 in (!24l) . we obtain 

e^\X*\e < 2e^|(S + ^I)-'\e - Tr((S + ^I)-'). 
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Combining the above two inequalities and (l23l) . we have 

\\X*\\ < ||X*||^<e^|X*|e<min|^^^^^,7^, (25) 

where 

e'^|S~^|e, if S is invertible; 

2e^|(E + f/)-^ - Tr((S + f/)"^), otherwise. 
Further, using the relation X* >z al, we obtain that 
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e^\X*\e > \\X*\\f > V^a, 
which together with fl22l) implies that 

Tr(X*S) <n- p^a. 
This inequality along with the relation X* >z al yields 

A^in(S)((n - l)a + ||X*||) < Tr(X*S) <n- p^a. 
Hence if S is invertible, we further have 

||X*|| <{n- pv^a)||S-^|| - (n - l)a. 

This together with (!25|) implies that X* ^ j3I, where (3 is given in (fT8|) . ■ 

Remark. Some bounds on X* were also derived in d'Aspremont et al. [9j. In contrast 
with their bounds, our bounds given in (fTSll are tighter. Moreover, our approach for deriving 
the above bounds can be generalized to handle the case where <5 > and /? = oo, but their 
approach cannot. Indeed, if a > and /5 = oo, we can set a = a, and replace the above X{t) 
by the optimal solution of 

max logdetX - (S + p/,X) 
s.t. al < X, 

which has a closed-form expression. By following a similar derivation as above, one can obtain 
a positive scalar /3 such that X* ^ (31. In addition, for the case where a = and < /3 < oo, 
one can set (3 = P and easily show that X* > al, where a = pe~^^'^'^^^^~^"'^\ m 

From the above discussion, we conclude that problem (fT6|) is equivalent to the following 
problem: 

max logdetX — (E,X) — pe^lXle 
X ^ \ . / I I .26) 

s.t. al^X^pi, 

for some < a < /3 < oo. 



11 



We further observe that problem (126!) can be rewritten as 



max min log det X - (S + pf/, X) , (27) 

X&X U&A 

where U is defined in ( IT9l) . and X is defined as follows: 

X ■={X eS"" ■ al^X^ (31}. (28) 



Therefore, we conclude that problem (1161) is equivalent to (!27l) . For the remaining paper, we 
will focus on problem ( 1271) only. 



3.3 Smooth optimization method for sparse covariance selection 

In this subsection, we describe the implementation details of the Smooth Minimization Al- 
gorithm proposed in Section [2] for solving problem ( |271) . We also compare the complexity 
of this algorithm with interior point methods, and two other first-order methods studied in 
d'Aspremont et al. [9j, that is, Nesterov's smooth approximation scheme and block coordinate 
descent method. 

We first observe that the sets X and U both lie in the space 5", where X and U are 
defined in ( l28l) and ( |T9l) . respectively. Let iS" be endowed with the Frobenius norm, and let 
d{X) = log det X for X ^ X. Then for any X G A", we have 

V^d{X)[H,H] = -Ti{X-^HX-^H) < -(3-^\\H\\l 

for all H G iS", and hence, d{X) is strongly concave on X with modulus Using this 

result and Theorem 1 of [21], we immediately conclude that V/(f/) is Lipschitz continuous 
with constant L = p^jS'^ on U, where 

/(?7) :=max logdetX- (S + p[/,X), Vt/ G W. (29) 

Denote the unique optimal solution of problem ( l29l) by X{U). For any U EU,we can compute 
X{U), f{U) and V/(f/) as follows. 

Let S + pU = Qdiag{'j)Q^ be an eigenvalue decomposition of S + pU such that QQ^ = I. 
For i = 1, . . . , n, let 

f min{max{l/7j, «},/?}, if 7^ > 0; 
* 1^ /?, otherwise. 

It is not hard to show that 

n 

X([/) = Qdiag(A)Q^, /(f/) = -7^A + 5^1ogA„ V f{U) = -pX{U). (30) 

i=l 

From the above discussion, we see that problem ( 1271) has exactly the same form as ([2]), and 
also satisfies all assumptions imposed on problem ([2]). Therefore, it can be suitably solved by 
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the Smooth Minimization Algorithm proposed in Section [2l The implementation details of 
this algorithm for problem (!27j) are described as follows. 

Given Uq G U, let d{U) = \\U — Uo\\p/2 be the proximal function on U, which is strongly 
convex function with modulus a = 1. For our specific choice of the norm and d{U), we clearly 
see that steps 2) and 3) of the Smooth Minimization Algorithm can be solved as a problem 
of the form 

\/ = argmin(G,f/) + ||f/||y2 

for some G G 5". In view of ffTI?]) . we see that 

Vij = max{mm{-Gij, 1}, -1}, i,j = l,...,n. 

In addition, for any X G A", we define 

g{X) :=logdetX- (E,X) -pe^|X|e. (31) 

For the ease of comparison with its latter variant, we now present a complete version of 
the aforementioned Smooth Minimization Algorithm for solving problem (127|) and its dual. 

Smooth Minimization Algorithm for Covariance Selection (SMACS): 

Let e > and Uq eU he given. Set X_i = 0, L = cr = 1, and k = 0. 

1) Compute VfiUk) and X{Uk). Set X^ = ^Xfc_i + ^,X{Uk). 

2) Find Uf = argmin {(V/(f/fc), t/ - f/,) + f \\U - U^fp : UeU}. 

3) FindC/fc"^ = argmin|^||f/-f/o||^ + i:^[/(f/,) + (V/(t/,),t/-f/,)]: f/Gw|. 

4) Set = 43[/r + muf- 

5) Set k^k + 1. Go to step 1) until f{Uf) - g{Xk) < e. 
end 

The iteration complexity of the above algorithm for solving problem fl27|) is established in 
the following theorem. 

Theorem 3.2 The iteration complexity performed by the algorithm SMACS for finding an 
e-optimal solution to problem (22) '^'^d its dual does not exceed \/2p/3 max ||[/ — f/o||i?/A/e, and 



moreover, if Uq = 0, it does not exceed ^J2p(5nl 

Proof. From the above discussion, we know that L = p'^jS'^, D = max||f/ — f/o|lF/2 
and (7=1, which together with Theorem 12.21 immediately implies that the first part of the 
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statement holds. Further, if Uq = 0, we easily obtain from (fT9l) that D = max ||f/|||./2 = ?2^/2. 
The second part of the statement directly follows from this result and Theorem I2.2[ ■ 

Remark. By the definition of U (see f fT9|) ). we can easily show that min max \\U — Uo\\f 

has a unique minimizer ?7o = 0. This result together with Theorem 13.21 implies that the initial 
point [/q = gives the optimal worst-case iteration complexity for the algorithm SMACS. ■ 

Alternatively, d'Aspremont et al. [9] applied Nesterov's smooth approximation scheme [2T] 
to solve problem (|27|) . More specifically, let e > be the desired accuracy, and let 

d{U) = \\U\\l/2, D = ^axd{U) = nV2. 

As shown in [21], the non-smooth function g{X) defined in (13T|) is uniformly approximated by 
the smooth function 

g,{X) = min logdet X - (S + pU, X) - -^d{U) 

on X with the error at most by e/2, and moreover, the function ge{X) has a Lipschitz contin- 
uous gradient on X with some constant L(e) > 0. Nesterov's smooth optimization technique 
[T9l l2T] is then applied to solve the perturbed problem max(7e(X), and problem ( |27j) is ac- 
cordingly solved. It was shown in [9j that the iteration complexity of this approach for finding 
an e-optimal solution to problem (!27j) does not exceed 



-e + "V^ ^^'^ 

where k:= 13/a. 

In view of ( 132|) and Theorem 13. 2[ we conclude that the smooth optimization approach 
improves upon Nesterov's smooth approximation scheme at least by a factor 0{y/n\ogKl y/e) 
in terms of the iteration complexity for solving problem ( 127|) . Moreover, the computational 
cost per iteration of the former approach is at least as cheap as that of the latter one. 

d'Aspremont et al. [9] also studied a block-coordinate descent method for solving problem 
(fT6|) with (5 = and /3 = oo. Each iterate of this method requires computing the inverse 
of an (n — 1) X (n — 1) matrix, and solving a box constrained quadratic programming with 
n — 1 variables. As mentioned in Section 3 of [9] , this method has a local linear convergence 
rate. However, its global iteration complexity for finding an e-optimal solution is theoretically 
unknown. Moreover, this method is not suitable for solving problem f|T6|) with a > or 
/3 < oo. 

In addition, we observe that problem ( l26l) (also ( fT6l) ) can be reformulated as a con- 
strained smooth convex problem that has an explicit (9 (n^) -logarithmically homogeneous self- 
concordant barrier function. Thus, it can be suitably solved by interior point (IP) methods 
(see Nesterov and Nemirovski [22] and Vandenberghe et al. p^). The worst-case iteration 
complexity of IP methods for finding an e-optimal solution to fl26|) is O(nlog(eo/e)), where eo 
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is an initial gap. Each iterate of IP methods requires 0{n^) arithmetic cost for assembhng 
and solving a typically dense Newton system with 0{n^) variables. Thus, the total worst-case 
arithmetic cost of IP methods for finding an e-optimal solution to (!26l) is (9(n^ log(eo/e)). In 
contrast with IP methods, the algorithm SMACS requires 0{n^) arithmetic cost per iteration 
dominated by eigenvalue decomposition and matrix multiplication of n x n matrices. Based 
on this observation and Theorem 13.21 we conclude that the overall worst-case arithmetic cost 
of the algorithm SMACS for finding an e-optimal solution to fl26l) is 0{p|3n^/^/e), which is 
substantially superior to that of IP methods, provided that pjS is not too large and e is not 
too small. 



3.4 Variant of Smooth Minimization Algorithm 

As discussed in Subsection 13. 3[ the algorithm SMACS has a nice theoretical complexity in 
contrast with IP methods, Nesterov's smooth approximation scheme, and block-coordinate 
descent method. However, its practical performance is still not much attractive (see Section 
H]). To enhance the computational performance, we propose a variant of the algorithm SMACS 
for solving problem (1271) in this subsection. 

Our first concern of the algorithm SMACS is that the eigenvalue decomposition of two 
n X n matrices is required per iteration. Indeed, the eigenvalue decomposition of S -|- pUk 
and S -|- pU^'^ is needed at steps 1) and 5) to compute V f{Uk) and f{U^'^), respectively. We 
also know that the eigenvalue decomposition is one of major computations for the algorithm 
SMACS. To reduce the computational cost, we now propose a new termination criterion other 
than f{U^'^) - g{Xk) < e that is used in the algorithm SMACS. In view of Theorem 12.41 we 
know that 

/([/,) -(?(X(f/,))^0, asfc^oo. 

Thus, f{Uk) — g{X{Uk)) < e can be used as an alternative termination criterion. Moreover, it 
follows from (l30l) that the quantity f{Uk) — g{X{Uk)) is readily available in step 1) of the algo- 
rithm SMACS with almost no additional cost. We easily see that the algorithm SMACS with 
this new termination criterion would require only one eigenvalue decomposition per iteration. 
Despite this clear advantage, we shall mention that the iteration complexity of the resulting 
algorithm is unfortunately unknown. Nevertheless, in practice we have found that the number 
of iterations performed by the algorithm SMACS with the above two different termination 
criteria are almost same. Thus, f{uk) — g{x{uk)) < e is a useful practical termination criterion. 

For sparse covariance selection, the penalty parameter p is usually small, but the parameter 
(3 can be fairly large. In view of Theorem 13.21 we know that the iteration complexity of the 
algorithm SMACS for solving problem (1271) is proportional to (3. Therefore, when (3 is too 
large, the complexity and practical performance of this algorithm become unattractive. To 
overcome this drawback, we will propose one strategy to dynamically update (3. 

Let X* be the unique optimal solution of problem (l27j) . For any [3 G [Xma.x{X*), (3], we 
easily observe that X* is also the unique optimal solution to the following problem: 

(P^) maxmin logdetX-(S + p[/,X), (33) 
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where U is defined in ( !T9!l . and is given by 

X^:={X: al^X^ (31}. 

In view of Theorem 13.21 the iteration complexity of the algorithm SMACS for problem fl33l) 
is lower than that for problem fl27|) provided [3 G [X^s,x{X*), (3). Hence ideally, we set /3 = 
Amax(-^*), which would give the lowest iteration complexity, but unfortunately, Amax(-^*) 
is unknown. However, we can generate a sequence {/3a:}^o ^^^^ asymptotically approaches 
Amax(-^*) as the algorithm progresses. Indeed, in view of Theorem 12. 41 we know that X{Uk) — > 
X* as k oo, and we obtain that 

Amax(^(f^fc)) Amax(^*), aS A; ^ OO. 

Therefore, we see that {Amax(-^(f^fc))}fclo '^^^ t)e used to generate a sequence {/3fc}^Q that 
asymptotically approaches Amax(Ar*). We next propose a strategy to generate such a sequence 

For convenience of presentation, we introduce some new notations. Given any U & U and 
(3 G [a, we define 

Xp{U) := argmax logdetX - (S + pf/,X), (34) 
f^iU) := max logdetX- (S + pt/,X). (35) 

Definition 1 Given any U and (3 G Xp{U) is called "active" if Xmax{Xp{U)) = f3 

and (3 < (3; otherwise it is called "inactive" . 

Let <ji, <,2 > 1, and ^3 G (0, 1) be given and fixed. Assume that Uk &U and Pk £ are 
given at the beginning of the kth iteration for some k > 0. We now describe the strategy for 
generating the next iterate Uk+i and Pk+i by considering the following three different cases: 

1) If Xp^{Uk) is active, find the smallest s G 2+ such that X^{Uk) is inactive, where 

J3 = mm{<;l/3k, P}- Set Pk+i = (3, and apply the algorithm SMACS for problem {Pp^_^^) 
starting with the point Uk and set its next iterate to be Uk+i- 

2) If X^^{Uk) is inactive and Amax(Ar^^(f^fc)) < 'isPk, set pk+i = max{min{^2Amax(^4(t^fc)), 
j3},a}. Apply the algorithm SMACS for problem (-Pg^^J starting with the point Uk, 
and set its next iterate to be Uk+i- 

3) If X0^{Uk) is inactive and Xmax{Xfj^{Uk)) > 'izPk, set (3k+i = Pk- Continue the algorithm 
SMACS for problem (-Pg^.), and set its next iterate to be Uk+i- 

For the sequences {f/^j^Q and {(3k]'k=Q recursively generated above, we observe that the 
sequence {X^^^^(f/fc)}^Q is always inactive. This together with (|35|) . (|29|) and the fact 

that /3k < (3 for k > 0, implies that 

fiUk)=f^^jUk), Vf{Uk) = Vf^^jUk), yk>0. (36) 
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Therefore, the new termination criterion f{Uk) — g{X{Uk)) < e can be replaced by 

fkjU.)-g{X^^jUu))<e (37) 

accordingly. 

We now incorporate into the algorithm SMACS the new termination criterion (137|) and 
the aforementioned strategy for generating a sequence {[3k}'kLo that asymptotically approaches 
Amax(-^*), and obtain a variant of the algorithm SMACS for solving problem (1271) . For con- 
venience of presentation, we omit the subscript k from (3k- 

Variant of Smooth Minimization Algorithm for Covariance Selection (VSMACS): 

Let e > 0, iji, > 1, and <J3 G (0, 1) be given. Choose a Uq ElA. Set (5 = (3, L = p'^P'^, cr = 1, 
and k = 0. 

1) Compute X^(f/fe) according to fl30l) . 

la) If Xp{Uk) is active, find the smallest s G 2+ such that Xp{Uk) is inactive, where 
j3 = mm{ql/3, /3}. Set k = 0, Uq = Uk, /3 = (3, L = p'^(3^, and go to step 2). 

lb) If X^{Uk) is inactive and Amax(Ar^(f^fc)) < ^sA set k = 0, Uq = Uk, 
(3 = max{min{<;2Amax(Ar^(f^fe)),/5},a}, and L = p'^(3'^. 

2) If fp{Uk) — g{Xj^{Uk)) < e, terminate. Otherwise, compute Vfj^iUk) according to ( l30i) . 

3) Find C/f = argmin |(V/^([/fc), U - Uk) + ^ \\U - Uk\\l ■ G w}. 

4) Findt/,"^ = argmin|^||t/-f/o|||. + E^[/^(f/.) + (V/^(f/,),f/-t/i)] : t/ G w|. 

5) Set Uk^, = ^3^/^ + mU^'- 

6) Set /c /c + 1, and go to step 1). 
end 

We next establish some preliminary convergence properties of the above algorithm. 

Proposition 3.3 For the algorithm VSMACS, the following properties hold: 

1) Suppose that the algorithm VSMACS terminates at some iterate {Xg{Uk),Uk) ■ Then 



(X^(t/fc), Uk) is an e-optimal solution to problem [27\ ) and its dual. 



2) Suppose that (3 is updated only for a finite number of times. Then the algorithm VSMACS 
terminates in a finite number of iterations, and produces an e-optimal solution to problem 
^^2l\j and its dual. 
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Proof. For the final iterate {X^{Uk), Uk), we clearly know that fpiJJk) —g{Xj^{Uk)) < e, and 
Xj^iUk) is inactive. As shown in (!36|) . f{Uk) = ffjiUk)- Hence, we have f{Uk) — g{X^{Uk)) < 
e. We also know that Uk G U, and X^{Uk) G A" due to /3 G Thus, statement 1) 

immediately follows. After the last update of (3, the algorithm VSMACS behaves exactly like 
the algorithm SMACS as applied to solve problem (Pg) except with the termination criterion 
f{Uk) — g{Xp{Uk)) < e. Thus, it follows from statement 1) and Theorem 12.41 that statement 
2) holds. ■ 

4 Computational results 

In this section, we compare the performance of the smooth minimization approach and its 
variant proposed in this paper with other first-order methods studied in |9l |T5], that is, 
Nesterov's smooth approximation scheme and block coordinate descent method for solving 
problem f|T6|) (or equivalently, fl27|) ) on a set of randomly generated instances. 

All instances used in this section were randomly generated in the same manner as described 
in d'Aspremont et al. [9J. First, we generate a sparse invertible matrix A E with positive 
diagonal entries and a density prescribed by q. We then generate the matrix 5 G 5" by 

B = A-^ + rV, 

where \^ G 5" is an independent and identically distributed uniform random matrix, and 
r is a small positive number. Finally, we obtain the following randomly generated sample 
covariance matrix: 

S = 5-min{A^in(5)-^,0}/, 

where is a small positive number. In particular, we set q = 0.01, r = 0.15 and i) = l.Oe — 4 
for generating all instances. 

As discussed in Section [3^ our smooth minimization approach has much better worst-case 
iteration complexity than Nesterov's smooth approximation scheme studied in d'Aspremont 
et al. [_9j for problem fl7r|) . However, it is unknown how their practical performance differs 
from each other. In the first experiment, we compare the practical performance of our smooth 
minimization approach and its variant with Nesterov's smooth approximation scheme studied 
in d'Aspremont et al. ^ for problem (!27|) with a = 0.1, (3 = 10 and p = 0.5. For convenience 
of presentation, we label these three first-order methods as SM, VSM, and NSA, respectively. 
The codes for them are written in Matlab. More specifically, the code for NSA follows the 
algorithm presented in d'Aspremont et al. [9], and the codes for SM and VSM are written 
in accordance with the algorithms SMACS and VSMACS, respectively. Moreover, we set 
"^1 = ^2 = 1-05 and ^3 = 0.95 for the algorithm VSMACS. These three methods terminate once 
the duality gap is less than e = 0.1. All computations are performed on an Intel Xeon 2.66 
GHz machine with Red Hat Linux version 8. 

The performance of the methods NSA, SM and VSM for the randomly generated instances 
are presented in Table [H The row size n of each sample covariance matrix S is given in col- 
umn one. The numbers of iterations of NSA, SM and VSM are given in columns two to four. 
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Table 1: Comparison of NSA, SM and VSM 



Problem 




Iter 






Obj 






Time 




n 


NSA 


SM 


VSM 


NSA 


SM 


VSM 


NSA 


SM 


VSM 


50 


3657 


457 


20 


-76.399 


-76.399 


-76.393 


49.0 


2.7 


0.1 


100 


7629 


920 


27 


-186.717 


-186.720 


-186.714 


900.4 


38.4 


0.4 


150 


20358 


1455 


49 


-318.195 


-318.194 


-318.184 


8165.7 


188.8 


2.0 


200 


27499 


2294 


102 


-511.246 


-511.245 


-511.242 


26172.5 


698.8 


9.2 


250 


45122 


3060 


128 


-3793.255 


-3793.256 


-3793.257 


87298.9 


1767.9 


19.8 


300 


54734 


3881 


161 


-3187.163 


-3187.171 


-3187.172 


184798.1 


3994.0 


45.5 


350 


64641 


4634 


182 


-2756.717 


-2756.734 


-2756.734 


351460.7 


7613.9 


83.6 


400 


74839 


5308 


176 


-3490.640 


-3490.667 


-3490.667 


614237.1 


13536.7 


116.9 



and the objective function values are given in columns five to seven, and the CPU times (in 
seconds) are given in the last three columns, respectively. From Table [H we conclude that: 
i) the method SM, namely, the smooth minimization approach, outperforms substantially the 
method NSA, that is, Nesterov's smooth approximation scheme; and ii) the method VSM, 
namely, the variant of the smooth minimization approach, substantially outperforms the other 
two methods. In addition, we see from this experiment that Nesterov's smooth minimization 
approach [19] is generally more appealing than his smooth approximation scheme [21] when- 
ever the problem can be solved as an equivalent smooth problem. Nevertheless, we shall 
mention that the latter approach has much wider field of application (e.g., see [Zlj), where 
the former approach cannot be applied. 

From the above experiment, we have already seen that the method VSM outperforms 
substantially two other first-order methods, namely, SM and NSA for solving problem ( 1271) . 
In the second experiment, we compare the performance of the method VSM with the block 
coordinate descent (BCD) methods studied in d'Aspremont et al. [9] and Friedman et al. 
[T5] on relatively large-scale instances. For convenience of presentation, we label these two 
methods as BCDl and BCD2, respectively. The method BCD2 was developed very recently 
and it is a slight variant of the method BCDl. In particular, each iterate of BCDl solves a 
box constrained quadratic programming by means of interior point methods, but each iterate 
of BCD2 applies a coordinate descent approach to solving a lasso (/i-regularized) least-squares 
problem, which is the dual of the box constrained quadratic programming appearing in BCDl. 
It is worth mentioning the methods BCDl and BCD2 are only applicable for solving problem 
( IT6l) with (5 = and (3 = oo. Thus, we only compare their performance with our method 
VSM for problem (fT6l) with such a and (3. As shown in Subsection 13.21 problem (fT6l) with 
(5 = and /5 = cxd is equivalent to problem fl271) with a and (3 given in f|T8l) . and hence it can 
be solved by applying the method VSM to the latter problem instead. 

The code for the method BCDl was written in Matlab by d'Aspremont and El Ghaoui [10] 
while the code for BCD2 was written in Fortran 90 by Friedman et al. [TH]. The methods BCDl 
and VSM terminate once the duality gap is less than e = 0.1. The original code [T6j for BCD2 
uses the average absolute change in the approximate solution as the termination criterion. In 
particular, the average absolute change in the approximate solution is evaluated at the end of 
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Table 2: Comparison of BCDl, BCD2 and VSM 



Problem 




Iter 






Obj 






Time 




n 


BCDl 


bcd2 


VSM 


BCDl 


bcd2 


VSM 


BCDl 


bcd2 


VSM 


100 


124 


200 


33 


-186.522 


-186.433 


-186.522 


22.3 


0.1 


0.5 


200 


531 


600 


109 


-449.210 


-449.179 


-449.209 


300.0 


1.3 


9.5 


300 


1530 


1500 


146 


-767.615 


-767.608 


-767.614 


2428.2 


80.9 


48.5 


400 


2259 


2400 


154 


-1082.679 


-1082.651 


-1082.677 


8402.4 


298.7 


112.3 


500 


3050 


3500 


154 


-1402.503 


-1402.457 


-1402.502 


22537.1 


640.2 


211.5 


600 


3705 


4200 


165 


-1728.628 


-1728.587 


-1728.627 


48950.4 


1215.0 


397.6 


700 


4492 


4900 


163 


-2057.894 


-2057.862 


-2057.892 


92052.7 


1972.5 


611.1 


800 


4958 


5600 


169 


-2392.713 


-2392.671 


-2392.712 


147778.9 


2872.3 


943.2 


900 


5697 


6300 


161 


-2711.874 


-2711.827 


-2711.874 


219644.3 


3593.7 


1268.5 


1000 


6536 


7000 


161 


-3045.808 


-3045.768 


-3045.808 


344687.8 


6098.7 


1710.0 



each cycle consisting of n block coordinate descent iterations, and their code terminates once 
it is below a given accuracy (see pp. 6 of [12] for details). According to our computational 
experience, we found with such a criterion, BCD2 is extremely hard to terminate for relatively 
large-scale instances (say n = 300) unless a maximum number of iterations is set. Obviously, 
it is not easy to choose a suitable maximum number of iterations for BCD2. Thus, to be as fair 
as possible to BCDl and VSM, we simply replace their termination criterion detailed in [16] 
for BCD2 by the one with the duality gap less than e = 0.1. In other words, the duality gap 
is computed at the end of each cycle consisting of n block coordinate descent iterations, and 
BCD2 terminates once it is below e = 0.1. It is worth remarking that the cost for computing 
a duality gap is 0{n^) since the inverse of an n x n symmetric matrix is needed. Thus, it is 
reasonable to compute duality gap once every n iterations rather than each iteration. 

All computations are performed on an Intel Xeon 2.66 GHz machine with Red Hat Linux 
version 8. The performance of the methods BCDl, BCD2 and VSM for the randomly gen- 
erated instances are presented in Table [2] The row size n of each sample covariance matrix 
S is given in column one. The numbers of iterations of BCDl, BCD2 and VSM are given 
in columns two to four, and the objective function values are given in columns five to eight, 
and the CPU times (in seconds) are given in the last three columns, respectively. From Table 
[21 we conclude both BCD2 and VSM substantially outperform BCDl. We also observe that 
our method VSM outperforms BCD2 for almost all instances except two relatively small-scale 
instances. 

In the above experimentation, we compared the performance of BCD2 and VSM for e = 0.1. 
We next compare their performance on the same instances as above and apply the same 
termination criterion as above except that we set up e = 0.01 and an upper bound of 20 hours 
computation time (or 72, 000 seconds) per instance for both codes. The performance of the 
methods BCD2 and VSM are presented in Table [31 The row size n of each sample covariance 
matrix S is given in column one. The numbers of iterations of BCD2 and VSM are given 
in columns two to three, and the objective function values are given in columns four to five, 
and the CPU times (in seconds) are given in the last two columns, respectively. It shall be 
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Table 3: Comparison of BCD2 and VSM 



Problem 


Iter 




Obj 


Time 


n 


bcd2 


VSM 


bcd2 


VSM 


bcd2 


VSM 


100 


200 


54 


-186.433 


-186.435 


0.1 


0.77 


200 


1200 


239 


-449.119 


-449.122 


2.1 


21.6 


300 


3000 


310 


-767.525 


-767.525 


32.1 


104.2 


400 


11778400 


321 


-1082.592 


-1082.589 


72000.0 


223.3 


500 


6997000 


309 


-1402.420 


-1402.413 


72001.0 


395.5 


600 


4637400 


318 


-1728.553 


-1728.538 


72004.0 


765.2 


700 


3215100 


310 


-2057.823 


-2057.804 


72005.0 


1330.0 


800 


2307200 


309 


-2392.644 


-2392.623 


72003.0 


1789.2 


900 


1846800 


289 


-2711.806 


-2711.784 


72024.0 


2394.0 


1000 


1257000 


283 


-3045.749 


-3045.718 


72051.0 


3115.8 



mentioned that BCD2 and VSM are both feasible methods, and moreover, f|T6|) and fl27|) are 
maximization problems. Therefore for these two methods, the larger objective function value, 
the better. From Table [31 we observe that up to accuracy e = 0.01, the method BCD2 cannot 
solve almost all instances within 20 hours except the first three relatively small-scale ones, but 
our method VSM does solve each of these instances in less than one hour and produces a better 
objective function values for almost all instances except the first three relatively small-scale 
ones. Also, it is interesting to observe that the number of iterations for VSM nearly doubles 
as the accuracy parameter e increases by one digit, which is even better than the theoretical 
estimate that is vTo according to Theorem I3.2[ 



5 Concluding remarks 

In this paper, we proposed a smooth optimization approach for solving a class of non-smooth 
strictly concave maximization problems. We also discussed the application of this approach 
to sparse covariance selection, and proposed a variant of this approach. The computational 
results showed that the variant of the smooth optimization approach outperforms substan- 
tially the latter one, and two other first-order methods studied in d'Aspremont et al. [9] and 
Friedman et al. [T3] . 

As discussed in Subsection 13. 3[ problem (127|) has the same form as ([2]), and satisfies all 
assumptions imposed on problem ([2]). Moreover, its associated objective function 0(X, f/) = 
logdetX — (S + pU,X) is affine with respect to U for every fixed X G 5"+. In view of 
these facts along with the remarks made in Section [21 one can observe that problem (1271) can 
be suitably solved by Nesterov's excessive gap technique [20j. Since the iterate complexity 
and the computational cost per iterate of this technique is same as those of the algorithm 
SMACS, we expect that the computational performance of these two methods for solving f[27j) 
are similar. It would be interesting to implement Nesterov's excessive gap technique pOj and 
its variant (that is, the one in a similar fashion to the algorithm VSMACS), and compare 
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their computational performance with SMACS and VSMACS, respectively. 

Though the variant of the smooth optimization approach outperforms substantially the 
smooth optimization approach, we are currently only able to establish some preliminary con- 
vergence properties for it. A possible direction leading to a thorough proof of its convergence 
would be to show that the updates on /3 in the algorithm VSMACS can occur only for a 
finite number of times. Given that VSMACS is a nonmonotone algorithm, it is, however, 
highly challenging to analyze the behavior of the sequences {Uk} and {X^(f/fc)} and hence 

the total number of updates on /3. Interestingly, we observed in our implementation that 
when P > Amax(-^*), the sequence {X^{Uk)} generated by the algorithm VSMACS satisfies 

^rnn^i^ (^{Uk)) G [Xraa.x{^*), P), where X* is the optimal solution of problem fl271) . Neverthe- 
less, it remains completely open whether this holds in general or not. In addition, the ideas 
used in the variant of the smooth optimization approach are interesting in their own right even 
when viewed as some heuristics. They could also be used to enhance the practical performance 
of Nesterov's first-order methods [191 EI] for solving some general min-max problems. 

The codes for the variant of the smooth minimization approach are written in Matlab and 
C, which are available online at www.math.sfu.ca/~zhaosong. The C code for this method 
can solve large-scale problems more efficiently provided that LAPACK package is suitably 
installed. We will plan to extend these codes for solving more general problems of the form 

max log det X — (S, X) — ^ LJij\Xij \ 
s.t. al ^X ^ pi, 

X,, = 0, y{t,3)en, 

for some set Q, where Uij = ojji > for all z, j = 1, . . . , n, and < a < /3 < cxd are some fixed 
bounds on the eigenvalues of the solution. 
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